forked from grpc/grpc
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OR-6597 #1
Open
dupbr01
wants to merge
1
commit into
ActianCorp:master
Choose a base branch
from
dupbr01:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
OR-6597 #1
+3
−1
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Member
dupbr01
commented
Aug 28, 2024
As discussed, initial review before posting upstream this won't be accepted into master branch. Commit message needs to include the upstream issue number, something like:
Replace NUMBER with 35989 |
Or:
which may be more accurate? |
mahinpandya
pushed a commit
that referenced
this pull request
Sep 24, 2024
…pc#36561) This should fix errors of the form - https://source.cloud.google.com/results/invocations/84e6c8cd-78df-45a3-8898-d703a2d38ac5/targets/%2F%2Ftest%2Fcore%2Fend2end:http2_stats_test@poller%3Dpoll/log ``` *** SIGSEGV received at time=1715064982 on cpu 0 *** PC: @ 0xffffaf404250 (unknown) absl::lts_20240116::Mutex::Lock() @ 0xffffb406e818 224 absl::lts_20240116::AbslFailureSignalHandler() @ 0xffffb45297b0 4768 (unknown) @ 0xffffb0266888 32 grpc_core::DelegatingClientCallTracer::DelegatingClientCallAttemptTracer::RecordEnd() @ 0xffffb14de408 64 grpc_core::ClientChannelFilter::FilterBasedLoadBalancedCall::Orphan() @ 0xffffb14fd2b0 48 grpc_core::RetryFilter::LegacyCallData::~LegacyCallData() @ 0xffffb14fc8e4 32 grpc_core::RetryFilter::LegacyCallData::Destroy() @ 0xffffb0ebc5bc 32 grpc_call_stack_destroy() @ 0xffffb14f4e34 48 grpc_core::DynamicFilters::Call::Destroy() @ 0xffffaff752b0 48 grpc_core::ExecCtx::Flush() @ 0xffffb44a6fb0 64 grpc_core::ExecCtx::~ExecCtx() @ 0xffffb14f2a90 160 absl::lts_20240116::internal_any_invocable::LocalInvoker<>() @ 0xffffb072c2fc 48 grpc_event_engine::experimental::SelfDeletingClosure::Run() @ 0xffffb072bd78 32 grpc_event_engine::experimental::WorkStealingThreadPool::ThreadState::Step() @ 0xffffb072ba7c 112 grpc_event_engine::experimental::WorkStealingThreadPool::ThreadState::ThreadBody() @ 0xffffb072c33c 48 grpc_event_engine::experimental::WorkStealingThreadPool::WorkStealingThreadPoolImpl::StartThread()::$_0::__invoke() @ 0xffffafb4bdc0 80 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix()::{lambda()#1}::__invoke() @ 0xffffaef95648 80 start_thread ``` I wasn't able to reproduce this but the fix seems correct. Internal ref: b/339452200 Closes grpc#36561 COPYBARA_INTEGRATE_REVIEW=grpc#36561 from yashykt:FixHttp2StatsTest 88f2962 PiperOrigin-RevId: 631860860
mahinpandya
pushed a commit
that referenced
this pull request
Sep 24, 2024
grpc#36753 has this ASAN test failure with the following error which doesn't seem to make sense. ([full log](https://btx.cloud.google.com/invocations/a587e5cc-ca1e-46ed-a3c3-199c581583db/targets)) ``` Executing tests from //test/cpp/util:grpc_tool_test@poller=epoll1 ----------------------------------------------------------------------------- ================================================================= ==15==ERROR: AddressSanitizer: odr-violation (0x7fcfa2961400): [1] size=66 'typeinfo name for google::protobuf::compiler::java::ImmutableExtensionLiteGenerator' external/com_google_protobuf/src/google/protobuf/compiler/java/lite/extension.cc in /b/f/w/bazel-out/k8-fastbuild/bin/test/cpp/util/grpc_tool_test@poller=epoll1.runfiles/com_github_grpc_grpc/test/cpp/util/../../../_solib_k8/libexternal_Scom_Ugoogle_Uprotobuf_Ssrc_Sgoogle_Sprotobuf_Scompiler_Sjava_Slite_Sliblite.so [2] size=66 'typeinfo name for google::protobuf::compiler::java::ImmutableExtensionLiteGenerator' external/com_google_protobuf/src/google/protobuf/compiler/java/lite/extension.cc in /b/f/w/bazel-out/k8-fastbuild/bin/test/cpp/util/grpc_tool_test@poller=epoll1.runfiles/com_github_grpc_grpc/test/cpp/util/../../../_solib_k8/libexternal_Scom_Ugoogle_Uprotobuf_Ssrc_Sgoogle_Sprotobuf_Scompiler_Sjava_Slite_Slibfield_Ugenerators.so These globals were registered at these points: [1]: #0 0x5626b4ea1488 in __asan_register_globals /tmp/clang-build/src/compiler-rt/lib/asan/asan_globals.cpp:369:3 #1 0x5626b4ea2559 in __asan_register_elf_globals /tmp/clang-build/src/compiler-rt/lib/asan/asan_globals.cpp:352:3 #2 0x7fcfa3fa4b99 (/lib64/ld-linux-x86-64.so.2+0x11b99) (BuildId: 7ae2aaae1a0e5b262df913ee0885582d2e327982) [2]: #0 0x5626b4ea1488 in __asan_register_globals /tmp/clang-build/src/compiler-rt/lib/asan/asan_globals.cpp:369:3 #1 0x5626b4ea2559 in __asan_register_elf_globals /tmp/clang-build/src/compiler-rt/lib/asan/asan_globals.cpp:352:3 #2 0x7fcfa3fa4b99 (/lib64/ld-linux-x86-64.so.2+0x11b99) (BuildId: 7ae2aaae1a0e5b262df913ee0885582d2e327982) ==15==HINT: if you don't care about these errors you may set ASAN_OPTIONS=detect_odr_violation=0 SUMMARY: AddressSanitizer: odr-violation: global 'typeinfo name for google::protobuf::compiler::java::ImmutableExtensionLiteGenerator' at external/com_google_protobuf/src/google/protobuf/compiler/java/lite/extension.cc in /b/f/w/bazel-out/k8-fastbuild/bin/test/cpp/util/grpc_tool_test@poller=epoll1.runfiles/com_github_grpc_grpc/test/cpp/util/../../../_solib_k8/libexternal_Scom_Ugoogle_Uprotobuf_Ssrc_Sgoogle_Sprotobuf_Scompiler_Sjava_Slite_Sliblite.so ==15==ABORTING ``` This turned out to be a known issue described at google/sanitizers#1017 and there is much to do other than disabling the ODR test. I gave a try to "-mllvm -asan-use-private-alias=1" option but it didn't change the result so I turned into this way. Closes grpc#36756 COPYBARA_INTEGRATE_REVIEW=grpc#36756 from veblush:asan-workaround 622581e PiperOrigin-RevId: 638421967
mahinpandya
pushed a commit
that referenced
this pull request
Sep 24, 2024
…grpc#37225) Noticed on a Core End2End test failure https://btx.cloud.google.com/invocations/dc3bf84d-e6ed-4b32-a24c-12489f981e46/targets/%2F%2Ftest%2Fcore%2Fend2end:cancel_with_status_test@poller%3Depoll1;config=56f5b09615e325097b100b58c41171656571290519a83c5d89a6067ef0283d46/log ``` F0000 00:00:1721017820.001684 87 tcp_server_posix.cc:354] Check failed: !s->shutdown *** Check failure stack trace: *** @ 0x7f32578da0e4 absl::lts_20240116::log_internal::LogMessage::SendToLog() @ 0x7f32578d9a94 absl::lts_20240116::log_internal::LogMessage::Flush() @ 0x7f32578da589 absl::lts_20240116::log_internal::LogMessageFatal::~LogMessageFatal() @ 0x7f3257e340a1 tcp_server_unref() @ 0x7f3258fcba8e grpc_core::Chttp2ServerListener::ActiveConnection::~ActiveConnection() @ 0x7f3258fd19e7 grpc_event_engine::experimental::MemoryAllocator::New<>()::Wrapper::~Wrapper() @ 0x7f3258fcc998 grpc_core::Chttp2ServerListener::OnAccept() @ 0x7f3257e34962 absl::lts_20240116::internal_any_invocable::LocalInvoker<>() @ 0x7f3257da6475 grpc_event_engine::experimental::PosixEngineListenerImpl::AsyncConnectionAcceptor::NotifyOnAccept()::$_1::operator()() @ 0x7f3257da4437 grpc_event_engine::experimental::PosixEngineListenerImpl::AsyncConnectionAcceptor::NotifyOnAccept() @ 0x7f3257da5fef absl::lts_20240116::base_internal::Callable::Invoke<>() @ 0x7f3257dca50a grpc_event_engine::experimental::PosixEngineClosure::Run() @ 0x7f3257c9013e grpc_event_engine::experimental::WorkStealingThreadPool::ThreadState::Step() @ 0x7f3257c8fe48 grpc_event_engine::experimental::WorkStealingThreadPool::ThreadState::ThreadBody() @ 0x7f3257c906df grpc_event_engine::experimental::WorkStealingThreadPool::WorkStealingThreadPoolImpl::StartThread()::$_0::__invoke() @ 0x7f32579a106c grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix()::{lambda()#1}::__invoke() @ 0x7f3257358609 start_thread ``` grpc#36563 changed the refcounting mechanism incorrectly and we ended up taking a ref on the tcp server outside the critical region, resulting in a time-of-check-to-time-of-use bug, where we could end up reffing the tcp server when it is already 0, i.e., when the listener has already been shutdown. This results in an attempt to destroy the tcp server twice and an eventual crash. Closes grpc#37225 COPYBARA_INTEGRATE_REVIEW=grpc#37225 from yashykt:FixChttp2Bug bc1e8df PiperOrigin-RevId: 654850991
mahinpandya
pushed a commit
that referenced
this pull request
Sep 24, 2024
Internal bug: b/357864682 A lock ordering inversion was noticed with the following stacks - ``` [mutex.cc : 1418] RAW: Potential Mutex deadlock: @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck() @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock() @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock() @ 0x564f4be968c5 grpc::internal::OpenTelemetryPluginImpl::RemoveCallback() @ 0x564f4cd097b8 grpc_core::RegisteredMetricCallback::~RegisteredMetricCallback() @ 0x564f4c1f1216 std::default_delete<>::operator()() @ 0x564f4c1f157f std::__uniq_ptr_impl<>::reset() @ 0x564f4c1ee967 std::unique_ptr<>::reset() @ 0x564f4c352f44 grpc_core::GrpcXdsClient::Orphaned() @ 0x564f4c25dad1 grpc_core::DualRefCounted<>::Unref() @ 0x564f4c4653ed grpc_core::RefCountedPtr<>::reset() @ 0x564f4c463c73 grpc_core::XdsClusterDropStats::~XdsClusterDropStats() @ 0x564f4c463d02 grpc_core::XdsClusterDropStats::~XdsClusterDropStats() @ 0x564f4c25efa9 grpc_core::UnrefDelete::operator()<>() @ 0x564f4c25d5f0 grpc_core::RefCounted<>::Unref() @ 0x564f4c25c2d9 grpc_core::RefCountedPtr<>::~RefCountedPtr() @ 0x564f4c25b1d8 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker() @ 0x564f4c25b240 grpc_core::(anonymous namespace)::XdsClusterImplLb::Picker::~Picker() @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>() @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref() @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref() @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr() @ 0x564f4c14e958 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker() @ 0x564f4c14e980 grpc_core::(anonymous namespace)::OutlierDetectionLb::Picker::~Picker() @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>() @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref() @ 0x564f4c124fb8 grpc_core::DualRefCounted<>::Unref() @ 0x564f4c11f029 grpc_core::RefCountedPtr<>::~RefCountedPtr() @ 0x564f4c26bafc std::pair<>::~pair() @ 0x564f4c26bb28 __gnu_cxx::new_allocator<>::destroy<>() @ 0x564f4c26b88f std::allocator_traits<>::destroy<>() @ 0x564f4c26b297 std::_Rb_tree<>::_M_destroy_node() @ 0x564f4c26abfb std::_Rb_tree<>::_M_drop_node() @ 0x564f4c26a926 std::_Rb_tree<>::_M_erase() @ 0x564f4c26a6f0 std::_Rb_tree<>::~_Rb_tree() @ 0x564f4c26a62a std::map<>::~map() @ 0x564f4c2691a4 grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker() @ 0x564f4c2691cc grpc_core::(anonymous namespace)::XdsClusterManagerLb::ClusterPicker::~ClusterPicker() @ 0x564f4c12c71a grpc_core::UnrefDelete::operator()<>() @ 0x564f4c1292ac grpc_core::DualRefCounted<>::WeakUnref() [mutex.cc : 1428] RAW: Acquiring absl::Mutex 0x564f4f22ad40 while holding 0x7f939834bb70; a cycle in the historical lock ordering graph has been observed [mutex.cc : 1432] RAW: Cycle: [mutex.cc : 1446] RAW: mutex@0x564f4f22ad40 stack: @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck() @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock() @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock() @ 0x564f4be96124 grpc::internal::OpenTelemetryPluginImpl::AddCallback() @ 0x564f4cd096f0 grpc_core::RegisteredMetricCallback::RegisteredMetricCallback() @ 0x564f4c1f111b std::make_unique<>() @ 0x564f4c3564b0 grpc_core::GlobalStatsPluginRegistry::StatsPluginGroup::RegisterCallback<>() @ 0x564f4c352dea grpc_core::GrpcXdsClient::GrpcXdsClient() @ 0x564f4c355bc6 grpc_core::MakeRefCounted<>() @ 0x564f4c3525f2 grpc_core::GrpcXdsClient::GetOrCreate() @ 0x564f4c28f8f8 grpc_core::(anonymous namespace)::XdsResolver::StartLocked() @ 0x564f4c2f5f82 grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartXdsResolver() @ 0x564f4c2f515d grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::ZoneQueryDone() @ 0x564f4c2f496b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()()::{lambda()#1}::operator()() @ 0x564f4c2f80f6 std::__invoke_impl<>() @ 0x564f4c2f7b9d _ZSt10__invoke_rIvRZZN9grpc_core12_GLOBAL__N_124GoogleCloud2ProdResolver11StartLockedEvENUlNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEN4absl12lts_202401168StatusOrIS8_EEE_clES8_SC_EUlvE_J... @ 0x564f4c2f748c std::_Function_handler<>::_M_invoke() @ 0x564f4b8ad682 std::function<>::operator()() @ 0x564f4cd1c6bf grpc_core::WorkSerializer::LegacyWorkSerializer::Run() @ 0x564f4cd1dae4 grpc_core::WorkSerializer::Run() @ 0x564f4c2f4b0b grpc_core::(anonymous namespace)::GoogleCloud2ProdResolver::StartLocked()::{lambda()#1}::operator()() @ 0x564f4c2f8dc7 absl::lts_20240116::base_internal::Callable::Invoke<>() @ 0x564f4c2f8cb8 absl::lts_20240116::base_internal::invoke<>() @ 0x564f4c2f8b16 absl::lts_20240116::internal_any_invocable::InvokeR<>() @ 0x564f4c2f8a0c absl::lts_20240116::internal_any_invocable::LocalInvoker<>() @ 0x564f4c2fb88d absl::lts_20240116::internal_any_invocable::Impl<>::operator()() @ 0x564f4c2fb1f3 grpc_core::GcpMetadataQuery::OnDone() @ 0x564f4cd75a72 exec_ctx_run() @ 0x564f4cd75ba9 grpc_core::ExecCtx::Flush() @ 0x564f4cc8ee1d end_worker() @ 0x564f4cc8f304 pollset_work() @ 0x564f4cc5dcaf pollset_work() @ 0x564f4cc69220 grpc_pollset_work() @ 0x564f4cbe7733 cq_pluck() @ 0x564f4cbe7ad5 grpc_completion_queue_pluck @ 0x564f4bc61d96 grpc::CompletionQueue::Pluck() @ 0x564f4bfdb055 grpc::ClientReader<>::ClientReader<>() @ 0x564f4bfd6035 grpc::internal::ClientReaderFactory<>::Create<>() @ 0x564f4bfc322b google::storage::v2::Storage::Stub::ReadObjectRaw() @ 0x564f4bf9934b google::storage::v2::Storage::Stub::ReadObject() [mutex.cc : 1446] RAW: mutex@0x7f939834bb70 stack: @ 0x564f4ce62fe5 absl::lts_20240116::DebugOnlyDeadlockCheck() @ 0x564f4ce632dc absl::lts_20240116::Mutex::Lock() @ 0x564f4be5886c absl::lts_20240116::MutexLock::MutexLock() @ 0x564f4c1ce9eb grpc_core::(anonymous namespace)::RlsLb::RlsLb()::{lambda()#1}::operator()() @ 0x564f4c1e794c absl::lts_20240116::base_internal::Callable::Invoke<>() @ 0x564f4c1e72c1 absl::lts_20240116::base_internal::invoke<>() @ 0x564f4c1e6af1 absl::lts_20240116::internal_any_invocable::InvokeR<>() @ 0x564f4c1e5d6c absl::lts_20240116::internal_any_invocable::LocalInvoker<>() @ 0x564f4be9d0c8 absl::lts_20240116::internal_any_invocable::Impl<>::operator()() @ 0x564f4be9b4ff grpc_core::RegisteredMetricCallback::Run() @ 0x564f4bea07ae grpc::internal::OpenTelemetryPluginImpl::CallbackGaugeState<>::CallbackGaugeCallback() @ 0x564f4bf844de opentelemetry::v1::sdk::metrics::ObservableRegistry::Observe() @ 0x564f4bf56529 opentelemetry::v1::sdk::metrics::Meter::Collect() @ 0x564f4bf8c1d5 opentelemetry::v1::sdk::metrics::MetricCollector::Collect()::{lambda()#1}::operator()() @ 0x564f4bf8c5ac opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::operator()() @ 0x564f4bf8c5e8 opentelemetry::v1::nostd::function_ref<>::BindTo<>()::{lambda()#1}::_FUN() @ 0x564f4bf7604d opentelemetry::v1::nostd::function_ref<>::operator()() @ 0x564f4bf74ad9 opentelemetry::v1::sdk::metrics::MeterContext::ForEachMeter() @ 0x564f4bf8c457 opentelemetry::v1::sdk::metrics::MetricCollector::Collect() @ 0x564f4bf4a7fe opentelemetry::v1::sdk::metrics::MetricReader::Collect() @ 0x564f4bed5e24 opentelemetry::v1::exporter::metrics::PrometheusCollector::Collect() @ 0x564f4bef004f prometheus::detail::CollectMetrics() @ 0x564f4beec26d prometheus::detail::MetricsHandler::handleGet() @ 0x564f4bf1cd8b CivetServer::requestHandler() @ 0x564f4bf35e7b handle_request @ 0x564f4bf29534 handle_request_stat_log @ 0x564f4bf39b3f process_new_connection @ 0x564f4bf3a448 worker_thread_run @ 0x564f4bf3a57f worker_thread @ 0x7f93e9137ea7 start_thread [mutex.cc : 1454] RAW: dying due to potential deadlock Aborted ``` From the stack, it looks like we are ending up holding a lock to the `RlsLB` policy while removing a callback from the gRPC OpenTelemetry plugin, which is a lock ordering inversion. The correct order is `OpenTelemetry` -> `gRPC OpenTelemetry plugin` -> `gRPC Component like RLS/xDSClient`. A common pattern we employ for metrics is for the callbacks to be unregistered when the corresponding component object is orphaned/destroyed (unreffing). Also, note that removing callbacks requires a lock in `gRPC OpenTelemetry plugin`. To avoid deadlocks, we remove the callback inside `RlsLb` from outside the critical region, but `RlsLb` owns refs to child policies which in turn hold refs to `XdsClient`. The lock ordering inversion occurred due to unreffing child policies within the critical region. This PR is an alternative fix to this problem. Original fix in grpc#37425. Verified that it fixes the bug. Closes grpc#37459 COPYBARA_INTEGRATE_REVIEW=grpc#37459 from yashykt:FixDeadlocks ec7fbcf PiperOrigin-RevId: 663360427
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.